library(tidyverse)
library(GGally)
library(ggfortify)
Load the housing_prices.csv data set and undertake an initial exploration of the data.
housing <- read_csv("data/housing.csv")
Rows: 20640 Columns: 10── Column specification ─────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ocean_proximity
dbl (9): longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, me...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
housing_prices <- read_csv("data/housing_prices.csv")
Rows: 19675 Columns: 10── Column specification ─────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ocean_proximity
dbl (9): longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, me...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(housing)
Rows: 20,640
Columns: 10
$ longitude <dbl> -122.23, -122.22, -122.24, -122.25, -122.25, -122.25, -122.25, -122.25, -122.26,…
$ latitude <dbl> 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37.84, 37.84, 37.84, 37.85, 37.…
$ housing_median_age <dbl> 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52, 52, 52, 50, 52, 52, 50, 52, …
$ total_rooms <dbl> 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555, 3549, 2202, 3503, 2491, 696,…
$ total_bedrooms <dbl> 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, 434, 752, 474, 191, 626, 283,…
$ population <dbl> 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 1551, 910, 1504, 1098, 345, 121…
$ households <dbl> 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, 402, 734, 468, 174, 620, 264,…
$ median_income <dbl> 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.1200, 2.0804, 3.6912, …
$ median_house_value <dbl> 452600, 358500, 352100, 341300, 342200, 269700, 299200, 241400, 226700, 261100, …
$ ocean_proximity <chr> "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BA…
summary(housing)
longitude latitude housing_median_age total_rooms total_bedrooms population
Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2 Min. : 1.0 Min. : 3
1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448 1st Qu.: 296.0 1st Qu.: 787
Median :-118.5 Median :34.26 Median :29.00 Median : 2127 Median : 435.0 Median : 1166
Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636 Mean : 537.9 Mean : 1425
3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148 3rd Qu.: 647.0 3rd Qu.: 1725
Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320 Max. :6445.0 Max. :35682
NA's :207
households median_income median_house_value ocean_proximity
Min. : 1.0 Min. : 0.4999 Min. : 14999 Length:20640
1st Qu.: 280.0 1st Qu.: 2.5634 1st Qu.:119600 Class :character
Median : 409.0 Median : 3.5348 Median :179700 Mode :character
Mean : 499.5 Mean : 3.8707 Mean :206856
3rd Qu.: 605.0 3rd Qu.: 4.7432 3rd Qu.:264725
Max. :6082.0 Max. :15.0001 Max. :500001
We expect the total_rooms of houses to be strongly correlated with total_bedrooms. Use ggpairs() to investigate correlations between these two variables.
housing %>%
select(total_rooms, total_bedrooms) %>%
ggpairs(progress = FALSE)
So, we do find significant correlations. Let’s drop total_bedrooms from the dataset, and use only total_rooms going forward.
housing_trim <- housing %>%
select(-total_bedrooms)
housing_prices_trim <- housing_prices %>%
select(-total_bedrooms)
We could engineer some new variables.
housing_trim_eng <- housing_trim %>%
mutate(rooms_per_house = total_rooms / households,
people_per_house = population / households,
rooms_per_person = total_rooms / population)
housing_trim_eng
housing_prices_eng <- housing_prices_trim %>%
mutate(rooms_per_house = total_rooms / households,
people_per_house = population / households,
rooms_per_person = total_rooms / population)
housing_prices_eng
NA
NA
And take a look at these using skim() - which when we
do, we can see that a lot of the variables are positively skewed.
Positively skewed data is not good for linear models.
housing_trim_eng %>%
skimr::skim() %>%
view()
Where we have skewed data that is non-negative, we can log transform it to make it more normal.
housing_prices_eng %>%
ggplot(aes(x = median_income)) +
geom_histogram()
housing_prices_eng %>%
ggplot(aes(x = median_house_value)) +
geom_histogram()
housing_log <- housing_prices_eng %>%
select(-c(ocean_proximity, latitude, longitude, housing_median_age)) %>%
mutate(across(everything(), log)) %>%
rename_with(~ paste0("log_", .x)) %>%
bind_cols(housing_prices_eng)
housing_log %>%
ggplot(aes(x = log_median_income)) +
geom_histogram()
housing_log %>%
ggplot(aes(x = log_median_house_value)) +
geom_histogram()
With values now transformed and more normal. Lets have a look for some relationships.
housing_log %>%
ggplot(aes(x = median_income, y = median_house_value)) +
geom_point()
housing_log %>%
ggplot(aes(x = ocean_proximity, y = median_house_value)) +
geom_boxplot()
There appears to be some kind of relationship between ocean proximity and house value.
housing_log %>%
ggplot(aes(x = longitude, y = latitude, colour = ocean_proximity)) +
geom_point()
Lets group some of those levels!
housing_ocean <- housing_log %>%
mutate(ocean_prox_grouped = if_else(
ocean_proximity %in% c("<1H OCEAN", "NEAR BAY", "NEAR OCEAN"), "NEAR WATER", ocean_proximity))
housing_ocean
housing_ocean %>%
ggplot(aes(x = longitude, y = latitude, colour = ocean_prox_grouped)) +
geom_point()
housing_ocean %>%
select(log_median_house_value,
log_total_rooms,
log_population,
log_households,
log_rooms_per_house,
log_people_per_house,
log_rooms_per_person) %>%
ggpairs(progress = FALSE)
housing_ocean %>%
select(log_median_house_value, housing_median_age, log_median_income, ocean_prox_grouped) %>%
ggpairs(aes(colour = ocean_prox_grouped,
alpha = 0.5),
progress = FALSE)
model1 <- lm(log_median_house_value ~ log_median_income,
data = housing_ocean)
autoplot(model1)
summary(model1)
Call:
lm(formula = log_median_house_value ~ log_median_income, data = housing_ocean)
Residuals:
Min 1Q Median 3Q Max
-2.59210 -0.26868 0.00162 0.25743 2.22388
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.094953 0.008498 1305.6 <2e-16 ***
log_median_income 0.776362 0.006600 117.6 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4086 on 19673 degrees of freedom
Multiple R-squared: 0.4129, Adjusted R-squared: 0.4129
F-statistic: 1.384e+04 on 1 and 19673 DF, p-value: < 2.2e-16
Once have a Simple Linear Regression we want to identify which predictors to add next.
We do this by considering our residuals.
library(modelr)
housing_ocean %>%
add_residuals(model = model1) %>% # Adds redis column, difference from actual to fitted median_house_value
select(log_median_house_value,
ocean_prox_grouped,
log_rooms_per_person,
log_people_per_house,
resid) %>%
ggpairs(progress = FALSE)
model1 <- lm(log_median_house_value ~ log_median_income,
data = housing_ocean)
autoplot(model1)
We are interested in developing a regression model for the median_house_value of a house in terms of the possible predictor variables in the dataset.
Use ggpairs() to investigate correlations between median_house_value and the predictors (this may take a while to run, don’t worry, make coffee or something).
ggpairs(housing_trim, progress = FALSE)
Perform further ggplot visualisations of any significant correlations you find.
ggpairs() shows a strong correlation between
median_house_value and median_income, with a
Correlation Coefficient of 0.688.
This is visualised below.
housing_trim %>%
ggplot(aes(x = median_income, y = median_house_value)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Shortly we may try a regression model to fit the categorical predictor ocean_proximity. Investigate the level of ocean_proximity predictors. How many dummy variables do you expect to get from it?
There are 5 levels of ocean_proximity so we would use 4
dummies.
housing_trim %>%
distinct(ocean_proximity)
Start with simple linear regression. Regress median_house_value on median_income and check the regression diagnostics.
housing_model <- lm(median_house_value ~ median_income,
data = housing_trim)
autoplot(housing_model)
Plot 1: I’m unsure of the correct interpretation of this; however, the smoothed line suggests no clear pattern to distribution staying relatively flat and close to 0. To the my eye though it looks as though lower fitted values are over-estimated and higher fitted values are under-estimated. Plot 2: There is some variation from the normal distribution, although the residuals are relatively normally distributed. Plot 3: I’m unsure on the correct interpretation of this; however, the smoothed line stays relatively constant/flat although to my eye there looks to be funneling at the higher end of the fitted values.
Add another predictor of your choice. Check your assumptions, diagnostics, and interpret the model.
housing_trim %>%
ggplot(aes(x = median_house_value, y = ocean_proximity))+
geom_boxplot()
housing_model_2 <- lm(median_house_value ~ median_income + ocean_proximity,
data = housing_trim)
autoplot(housing_model_2)
My interpretations here are the same as the first model, although again I’m a little unsure if these are correct.
Plot 1: I’m unsure of the correct interpretation of this; however, the smoothed line suggests no clear pattern to distribution staying relatively flat and close to 0. To the my eye though it looks as though lower fitted values are over-estimated and higher fitted values are under-estimated. Plot 2: There is some variation from the normal distribution, although the residuals are relatively normally distributed. Plot 3: I’m unsure on the correct interpretation of this; however, the smoothed line stays relatively constant/flat although to my eye there looks to be funneling at the higher end of the fitted values.
summary(housing_model_2)
Call:
lm(formula = median_house_value ~ median_income + ocean_proximity,
data = housing_trim)
Residuals:
Min 1Q Median 3Q Max
-507454 -46266 -12876 28563 475483
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 83470.0 1414.3 59.019 < 2e-16 ***
median_income 37018.7 279.6 132.384 < 2e-16 ***
ocean_proximityINLAND -77457.4 1232.8 -62.829 < 2e-16 ***
ocean_proximityISLAND 195375.2 33139.8 5.895 3.79e-09 ***
ocean_proximityNEAR BAY 21267.6 1731.2 12.285 < 2e-16 ***
ocean_proximityNEAR OCEAN 17675.1 1633.7 10.819 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 74080 on 20634 degrees of freedom
Multiple R-squared: 0.588, Adjusted R-squared: 0.5879
F-statistic: 5890 on 5 and 20634 DF, p-value: < 2.2e-16
The Adjusted R-squared value is 0.5879 meaning that together the predictors of median_income and ocean_proximity explain 69% of median_house_value variation.
The p-values for each of the individual predictors are very low, suggesting that all predictors statistically significant.